Ai Evaluation AI News & Research

🐍 Newsletters AI Snake Oil 39 min read

Open-world evaluations for measuring frontier AI capabilities

Introducing CRUX, a new project for evaluating AI on long, messy tasks

#AI evaluation #frontier AI #open-world testing

🕐 27 days ago

Read →

🛡️ Safety AI Alignment Forum 1 min read

My unsupervised elicitation challenge

Note: you are ineligible to complete this challenge if you’ve studied Ancient or Modern Greek, or if you natively speak Modern Greek, or if for other reasons you know what…

#llm #claude #ancient-greek

🕐 a month ago

Read →

📐 Research The Gradient 14 min read

What's Missing From LLM Chatbots: A Sense of Purpose

LLM-based chatbots’ capabilities have been advancing every month. These improvements are mostly measured by benchmarks like MMLU, HumanEval, and MATH (e.g. sonnet 3.5, gpt-4o). However, as these measures get more…

#LLMs #chatbots #AI evaluation

🕐 1 year, 8 months ago

Read →

🐍 Newsletters AI Snake Oil 15 min read

AI leaderboards are no longer useful. It's time to switch to Pareto curves.

What spending $2,000 can tell us about evaluating AI agents

#AI evaluation #benchmarking #cost-efficiency

🕐 2 years ago

Read →

Ai Evaluation AI News & Research · DeepTrendLab

Ai Evaluation

Open-world evaluations for measuring frontier AI capabilities

My unsupervised elicitation challenge

What's Missing From LLM Chatbots: A Sense of Purpose

AI leaderboards are no longer useful. It's time to switch to Pareto curves.